Skip to content

feat: incremental graph rebuild (3-PR sequence: G1+G2+G3)#284

Merged
HumanBean17 merged 9 commits into
masterfrom
feat/incremental-graph
Jun 7, 2026
Merged

feat: incremental graph rebuild (3-PR sequence: G1+G2+G3)#284
HumanBean17 merged 9 commits into
masterfrom
feat/incremental-graph

Conversation

@HumanBean17

@HumanBean17 HumanBean17 commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Scope

Implements the full incremental graph rebuild plan from plans/active/PLAN-INCREMENTAL-GRAPH.md as a single PR containing three logically distinct layers:

PR-G1: Hash tracker + source_file edge schema

  • Adds source_file STRING column to all 12 edge table DDLs for file-scoped deletion
  • Implements FileHashTracker class (SHA-256, atomic save, change detection)
  • Bumps ONTOLOGY_VERSION from 16 → 17 (re-index required)

PR-G2: Incremental rebuild orchestrator

  • Adds incremental_rebuild() function with scoped pass 1–4 and global pass 5–6
  • Implements _load_existing_types(), _load_existing_members() for cross-file resolution
  • Implements _find_dependents() for single-hop dependent expansion (cap: 50 files)
  • Implements phase-based _delete_file_scope() (all edges before any nodes)
  • Implements _scoped_write() for writing into existing DB without schema drop
  • Crash safety via .graph_increment_in_progress marker file with fallback to full rebuild
  • Adds --incremental CLI flag to build_ast_graph.py
  • Adds run_incremental_graph() wrapper to pipeline.py

PR-G3: CLI integration

  • Updates _cmd_increment() to run incremental graph update after CocoIndex
  • Adds --vectors-only flag to preserve old Lance-only behavior
  • Removes stale _INCREMENT_WARNING_LINES / _emit_increment_kuzu_warning()
  • Updates README CLI cheat sheet and roadmap
  • Updates docs/JAVA-CODEBASE-RAG-CLI.md

Files changed

File PR Changes
ast_java.py G1 ONTOLOGY_VERSION 16 → 17
build_ast_graph.py G1 source_file STRING on all 12 edge DDLs + edge-write queries + FileHashTracker
build_ast_graph.py G2 incremental_rebuild(), _load_existing_*(), _find_dependents(), _delete_file_scope(), _scoped_write(), --incremental flag
java_codebase_rag/pipeline.py G2/G3 run_incremental_graph() wrapper
java_codebase_rag/cli.py G3 increment calls graph update, --vectors-only, removes stale warning
tests/test_incremental_graph.py G1+G2 22 tests (9 G1 + 13 G2)
tests/test_java_codebase_rag_cli.py G3 Updated stale warning test + 5 new CLI tests
README.md G3 CLI cheat sheet + roadmap update
docs/JAVA-CODEBASE-RAG-CLI.md G3 increment command docs

Manual Evidence

# All 22 incremental tests pass
.venv/bin/python -m pytest tests/test_incremental_graph.py -v
============================== 22 passed ==============================

# CLI tests pass
.venv/bin/python -m pytest tests/test_java_codebase_rag_cli.py -v
============================== 19 passed ==============================

# Lint checks pass
.venv/bin/ruff check build_ast_graph.py ast_java.py tests/test_incremental_graph.py java_codebase_rag/cli.py java_codebase_rag/pipeline.py README.md docs/JAVA-CODEBASE-RAG-CLI.md tests/test_java_codebase_rag_cli.py
All checks passed!

# Full test suite passes
.venv/bin/python -m pytest tests -v
============ 670 passed, 9 skipped, 0 failed ============

Design Notes

  • Phase-based deletion: _delete_file_scope deletes ALL edges across all scope files first, then deletes nodes. This prevents Kuzu errors when file A's nodes have incoming edges from file B that haven't been cleaned up yet.
  • Pass 5–6 always global: Client/producer extraction and cross-service matching iterate all members/routes — cheap in-memory operations that ensure consistency.
  • source_file semantics: Origin-side file only (e.g., for CALLS edges, it's the caller's filename). Dependent expansion covers target-side changes.

Reindex Required

Existing installations must run java-codebase-rag reprocess once after upgrading to add the source_file column to edge tables. This is a one-time migration triggered by the ONTOLOGY_VERSION bump from 16 to 17.

🤖 Generated with Claude Code

HumanBean17 and others added 5 commits June 7, 2026 15:42
…ntal graph rebuild (PR-G1)

This commit implements PR-G1 of the incremental graph rebuild plan:
- Bump ONTOLOGY_VERSION to 17 (requires re-index)
- Add source_file STRING column to all 12 edge DDL constants
- Update _write_edges() to pass source_file for EXTENDS, IMPLEMENTS, INJECTS, DECLARES, OVERRIDES, CALLS, UNRESOLVED_AT
- Update _write_routes_and_exposes() to pass source_file for EXPOSES, DECLARES_CLIENT, DECLARES_PRODUCER, HTTP_CALLS, ASYNC_CALLS
- Add FileHashTracker class for detecting file changes (added, changed, removed)
- Add 9 tests for FileHashTracker and edge schema validation

Scope: PR-G1 (Hash tracker + source_file edge schema)
Plan: plans/active/PLAN-INCREMENTAL-GRAPH.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tests that verify EXTENDS edge dependent expansion were missing pass2_edges()
calls in their setup, resulting in no EXTENDS edges being written to the graph.
Also fixed crash marker not being cleaned up in the _fallback_to_full code path
and invalid Kuzu SHOW_TABLES syntax in test_incremental_pass5_6_always_global.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Process edge deletions across all scope files before deleting any nodes.
The previous per-file loop could fail when file B has an EXTENDS edge
to file A — deleting A's nodes first left dangling edges that Kuzu
refused to drop.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-G2)

Add subprocess wrapper that passes --incremental flag to build_ast_graph.py.
Part of incremental graph rebuild implementation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Update increment command to run both CocoIndex catch-up and incremental Kuzu graph update
- Add --vectors-only flag to preserve old Lance-only behavior
- Update CLI help texts and documentation
- Emit JSON output from incremental_rebuild for mode detection
- Add/update tests for new increment behavior

Increment now updates both Lance and Kuzu by default. The old stale
warning is only emitted when --vectors-only is used.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@HumanBean17 HumanBean17 changed the title feat: add source_file to edge schemas and FileHashTracker for incremental graph rebuild (PR-G1) feat: incremental graph rebuild (3-PR sequence: G1+G2+G3) Jun 7, 2026
HumanBean17 and others added 2 commits June 7, 2026 18:06
…mental rebuild

- Fix _write_clients_producers_and_calls: use correct parameter names
  ($sid/$cid/$pid/$rid) matching Cypher templates, add missing fields
  (strategy, method_call, raw_uri, match, direction, raw_topic)
- Use dict lookup instead of O(n) list scan for client/producer source_file
- Use keyword args for MemberEntry placeholder construction
- Delete db+conn before fallback to avoid file lock
- Remove redundant import json inside main()
- Remove stale duplicate comment in pass1_parse

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix _write_clients_producers_and_calls: use asdict(row) for Client/Producer
  nodes instead of manually constructed dicts with wrong field names
  (kind vs client_kind, target vs target_service, missing 10+ fields)
- Fix _delete_file_scope: add ALL edge tables to Phase 1 deletion
  (was missing EXPOSES, DECLARES_CLIENT, DECLARES_PRODUCER, HTTP_CALLS,
  ASYNC_CALLS — would crash on any Spring codebase with controllers)
- Use DETACH DELETE for Route/Client/Producer nodes as safety net
- Fix N+1 query in dependent expansion: single IN-query instead of per-file

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@HumanBean17

Copy link
Copy Markdown
Owner Author

PR #284 Review: Incremental Graph Rebuild (G1+G2+G3)

+2003 / -67 lines across 8 files — architecture is sound, but there's a critical runtime bug that must be fixed before merge.


Critical Bug: _write_clients_producers_and_calls will crash at runtime

This function has multiple issues that would cause failures on any real codebase with routes/clients/producers:

1. Wrong parameter names — The Cypher templates use $sid/$cid/$pid/$rid for MATCH node lookups, but the function passes $src/$dst:

# _CREATE_DECLARES_CLIENT expects: MATCH (s:Symbol {id: $sid}), (c:Client {id: $cid})
conn.execute(_CREATE_DECLARES_CLIENT, {
    "src": row.symbol_id,   # BUG: should be "sid"
    "dst": row.client_id,   # BUG: should be "cid"
    ...
})

Same mismatch for DECLARES_PRODUCER ($sid/$pid"src"/"dst"), HTTP_CALLS ($cid/$rid"src"/"dst"), and ASYNC_CALLS ($pid/$rid"src"/"dst").

2. Missing parameters — All four edge-writing loops omit required template parameters:

Edge type Missing params
DECLARES_CLIENT strategy
DECLARES_PRODUCER strategy
HTTP_CALLS strategy, method_call, raw_uri, match
ASYNC_CALLS strategy, direction, raw_topic, match

3. O(n²) lookup — Source file resolution for HTTP_CALLS/ASYNC_CALLS builds a new list and does .index() for every row:

tables.client_rows[[c.id for c in tables.client_rows].index(row.client_id)].filename

Should use a dict lookup like the existing _write_routes_and_exposes does.

Why tests don't catch this: The minimal test fixtures (class A {}, class B extends A {}) have no routes/clients/producers, so these loops are never entered. A test with Spring-annotated classes would surface the crash.


Significant Issues (should fix)

4. Duplicated _load_existing_types / _load_existing_types_filtered — These two functions (and their _members counterparts) are ~50 lines each and differ only by the AND NOT (s.filename IN $exclude_files) clause. Extract common logic into a single function with an optional exclude_files param.

5. Repetitive _find_dependents — Six nearly identical if/elif branches that only differ by the edge label name. Loop over a list of edge type strings instead.

6. _write_nodes_merge duplicates _write_nodes — These ~70-line functions differ only by using _MERGE_SYMBOL vs _CREATE_SYMBOL. Factor to a shared helper that accepts the query template as a parameter.

7. _file_by_node_id built twice independently — Once in _write_edges and once in _write_routes_and_exposes. Compute once and share.


Minor Issues (nice to fix)

8. Duplicate comment in pass1_parse — "Skip files not in scope" appears twice.

9. import json inside main() and _cmd_increment — should be top-level imports.

10. FileHashTracker.save() silently swallows OSError — consider logging a warning so it's discoverable when writes fail repeatedly.

11. _fallback_to_full duplicates hash-init logic from the no-DB branch of incremental_rebuild — extract to a shared helper.

12. AGENTS.md says ontology_version is 15 but is now 17 — worth updating.


Test Coverage Gap

No test with Spring annotations (routes, clients, producers) that exercises the pass 5-6 global path with actual Client/Producer/HTTP_CALLS/ASYNC_CALLS edges. This is why the critical bug above slipped through. Adding one test against the bank-chat-system fixture (which has Spring controllers) would catch it.


Summary

The design — phase-based deletion, single-hop dependent expansion with cap, crash marker, automatic fallback — is well-considered. The critical bug in _write_clients_producers_and_calls (wrong param names + missing params) must be fixed before merge. The duplication issues are worth addressing but not blocking.

HumanBean17 and others added 2 commits June 7, 2026 21:48
- Merge _load_existing_types/_load_existing_types_filtered into single
  function with optional exclude_files parameter (same for members)
- Simplify _find_dependents: loop over edge type strings instead of
  six identical if/elif branches
- Factor _write_nodes and _write_nodes_merge to shared _write_nodes_impl
  accepting the query template as parameter (~70 lines deduplicated)
- Extract _build_file_by_node_id and share between _write_edges and
  _write_routes_and_exposes (was built twice independently)
- Extract _init_hash_tracker helper for duplicated hash-init logic
  in _fallback_to_full and no-DB branch of incremental_rebuild
- Add warning log in FileHashTracker.save() instead of silent OSError
- Update AGENTS.md ontology_version references from 15 to 17

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…eview round 2

- Write phantom Route nodes in _write_clients_producers_and_calls using MERGE
  (pass5 creates phantom routes for cross-service calls that were never
  persisted to Kuzu, silently dropping HTTP_CALLS/ASYNC_CALLS edges)
- Remove redundant _load_existing_members before full pass1_parse in global
  pass 5-6 step (was creating duplicate stub members alongside full entries)
- Use conn.close() + del instead of bare del for Kuzu handle cleanup on
  fallback (avoids relying on CPython ref-counting for file lock release)
- Add FileNotFoundError handling in FileHashTracker.detect_changes for files
  that vanish between listing and hashing
- Remove redundant inline import json in cli.py _cmd_increment
- Update stale _delete_file_scope docstring to match implementation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@HumanBean17 HumanBean17 merged commit 67a76de into master Jun 7, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant